In most of the cases you will need to change the format of your dataset because it will not be in the right format that you will need to plot or analyse the data. tidyverse has a package called dplyr which contains functions that help you to select columns/rows, sort, combine and perform other data types of data transformations. In the next sections we will look at different ways to transform our dataset. Now that we already know the basics of visualing data with ggplot we will also learn how to visualise other plots with the transformed dataset as we go along.
To learn more about
dplyrplease look at the following resources:If you find these resources difficult to understand, return to these after completing the course.
Before we move onto the different ways that we can transform our data, we’ll discuss one of the most powerful additions to R, pipes.
Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr.
Let’s say we are interested only in the data that do not contain any missing values. Using pipes, we do the following:
surveys %>%
drop_na()This takes the argument on the left (surveys) and passes it on to the function after it (drop_na, which removes all rows with missing values).
If we want to create a new object with the transformed data we can assign it a new name as below:
surveys_complete <- surveys %>%
drop_na()
surveys_completeAlthough at this point pipes might not seem like a revolutionary invention, they become more powerful when combining multiple operations. This we’ll see next.
To select columns of a data frame or tibble, use theselect function. The first argument is the data frame or tibble you are working on (in our example it is surveys, which we pipe through), and the subsequent arguments are the columns to keep.
# Extract species_id, weight, hindfoot_lenth, year and sex columns from surveys dataset.
surveys %>%
select(species_id, weight, hindfoot_length, year, sex)To select all columns except certain ones, put a - in front of the column to exclude it.
# Select all columns of the surveys dataset apart from record_id and species_id columns.
surveys %>%
select(-record_id, -species_id)To remove rows from a data frame or tibble use the filter function from the dplyr package. The first argument is the data frame or tibble to perform the filtering on (in this case we pipe the data through) and the next arguments are the conditions on which to keep the rows.
To choose rows based on a specific condition, use the filter function as follows:
# Keep only the observations of animals collected from 1995 onwards from the surveys dataset.
surveys %>%
filter(year >= 1995)You can filter on multiple conditions:
# Keep only the observations of animals collected from 1995 onwards
# that are female from the surveys dataset.
surveys %>%
filter(year >=1995,
sex == "F")Note The sex column is a character and thus needs to be quoted, whereas the year column is numerical and does not. Also note that the filter arguments could have been written on a single line, but it is useful to break up your code sometimes to make it more readable.
Challenge 1 - pipes
Subset the
surveys_completedata to keep only thespecies_id,weight,hindfoot_length,yearandsexcolumns and the animals collected on and after 1995. Then plot a scatter plot ofweight(x-axis) againsthindfoot_length(y-axis) using this transformed dataset. Do all the above using pipes, without creating any variables.Answer
surveys_complete %>% # Select columns select(species_id, weight, hindfoot_length, year, sex) %>% # Filter rows filter(year >= 1995) %>% # Plot transformed data ggplot(mapping = aes(x = weight, y = hindfoot_length)) + geom_point()![]()
Challenge 2 - plotting subset with different colour
Plot all the animals in the
surveys_completedataset asweight(x-axis) againsthindfoot_length(y-axis). Use the dataset created above which contains only the animals that were collected on and after 1995 and highlight these points in red in the plot. Hint: create a new object calledsurvey_recentthat contains the dataset created above.Answer
survey_recent <- surveys_complete %>% select(species_id, weight, hindfoot_length, year, sex) %>% filter(year >= 1995) ggplot(mapping = aes(x = weight, y = hindfoot_length)) + geom_point(data = surveys_complete) + geom_point(data = survey_recent, colour = "red", alpha = 0.4)![]()
Note: In the example above we did not specify the
dataargument in theggplotfunction because the arguments in theggplotfunction should be common throughout the whole plot. In this case we specified the respective data as separate layers in 2 differentgeom_pointplots; the first plot is the data with all the animals insurveys_complete(points in black), the second plot is layered on top of the first plot and is a subset of thesurveys_completedataset (points in red).
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use the mutate function.
To create a new column of weight in kg:
surveys_complete %>%
mutate(weight_kg = weight / 1000)You can also create a second new column based on the first new column within the same call of mutate():
surveys_complete %>%
mutate(weight_kg = weight / 1000,
weight_lbs = weight_kg * 2.20462)There are other ways on how to create new columns. Refer to the dplyr cheat sheet Make New Variables section.
Challenge - the boxplot
Using the
surveys_completedataset, create a boxplot for each year on the x-axis and weight in kg on the y-axis. Use the subset dataset that was transformed in the previous challenge, i.e. the one that has the recent observations (from 1995 onwards). As before try to do all the operations using pipes, without creating variables.
Hint: You will need to use theas_factorfunction to convert theyearcolumn tofactortype.Your first attempt would be to use the code we used before to subset the
surveys_completedataset, but also adding themutatefunction to add a newweight_kgcolumn. Further more you will need to also add theggplotfunction to draw a boxplot:Answer
surveys_complete %>% # Add weight_kg column mutate(weight_kg = weight / 1000) %>% # Select columns select(species_id, weight_kg, hindfoot_length, year, sex) %>% # Filter rows filter(year >= 1995) %>% # Plot transformed data ggplot(mapping = aes(x = year, y = weight_kg)) + geom_boxplot()#> Warning: Continuous x aesthetic -- did you forget aes(group=...)?![]()
You might notice that this code produces a warning message about the continuous data type of the x-axis and also creates a box plot with one single box and whisker averaged over all the years on the x-axis, rather than one for each year in the dataset as requested. As shown in the
ggplot2cheat sheet, box plots display a discrete variable on the x-axis and a continuous variable on the y-axis. At the momentyearis treated as a continuous variable since its data type isdouble(you can see this from the tibble). To fix this, we need to convert the data type of theyearcolumn to a categorical one (factor) so that each year can be plotted as a separate box and whisker.The
forcatspackage that is part oftidyversehas theas_factorfunction which does just that; it converts a variable into a factor. See the updated code below, which uses theas_factorfunction:Answer
surveys_complete %>% # Add weight_kg column mutate(weight_kg = weight / 1000) %>% # Select columns select(species_id, weight_kg, hindfoot_length, year, sex) %>% # Filter rows filter(year >= 1995) %>% # Convert the year column to a factor mutate(year = as_factor(year)) %>% # Plot transformed data ggplot(mapping = aes(x = year, y = weight_kg)) + geom_boxplot()![]()
Notice that since we need to keep only observations that have been observed from 1995 onwards, we need to convert the
yearcolumn to afactorafter we filter the rows, as otherwise the row filtering operation will not be possible. This is because the newyearcolumn of typefactordoes not have any ordering in its categories so perfomingyear >= 1995would not be possible.It is also possible to convert the
yearvariable directly from withinggplot, thereby skipping themutatestep from above. We can do this using the updated code below (output is the same as above):surveys_complete %>% # Add weight_kg column mutate(weight_kg = weight / 1000) %>% # Select columns select(species_id, weight_kg, hindfoot_length, year, sex) %>% # Filter rows filter(year >= 1995) %>% # Convert the year variable and plot transformed data ggplot(mapping = aes(x = as_factor(year), y = weight_kg)) + geom_boxplot()
Note on the boxplot
As mentioned the boxplot is used to visualise the distribution of a continuous variable. In the example above we displayed the distribution of
weightover the different years. The figure below illustrates the different components of the boxplot. The bottom line of the box represents the 25th quantile, middle line the 50th quantile (or median) and the top line of the box is the 75th quantile. The whiskers can be thought of as defining the boundary of the confidence intervals of the distribution. The dots outside these whiskers are the outliers.![]()
To sort your data dplyr provides function arrange.
# Sort weight in ascending order (default)
surveys_complete %>%
arrange(weight)To sort your data in descending order you will need to use desc().
# Sort weight in descending order
surveys_complete %>%
arrange(desc(weight))You can sort your dataset based on the values of multiple columns:
# Sort weight in ascending order and hindfoot_length in descending order
surveys_complete %>%
arrange(weight, desc(hindfoot_length))As you can see from the result returned, the animals with the smallest weight are at the top. When there is a tie, i.e., more than one animal has the same weight, the animals are sorted in descending order of hindfoot_length. As you can see, the subset of animals with weight of 4 have been sorted in descending order based on hindfoot_length.
Creating summaries of your data would be a good way to start describing the variable you are working with. Summary statistics are a good example of how one can summarise data. We will not cover details about summary statistics in this course, but we will look at how we can summarise data in R. When working with continuous variables, one of the most popular summary statistic is the mean. If we try to caclulate the mean on weight in the surveys_complete dataset we get:
surveys_complete %>%
mean_weight = mean(weight)#> Error in mean(weight): object 'weight' not found
This is because in dplyr you will need to use the summarise function to be able to create summaries of your data.
The summarise function is used when you want to reduce multiple values in your columns to a single value. If we try to calculate the mean using summarise() this time:
surveys_complete %>%
summarise(mean_weight = mean(weight))#> # A tibble: 1 x 1
#> mean_weight
#> <dbl>
#> 1 41.8
As seen from the result, the mean of the values present in the weight column is 41.8.
countObtaining the frequency of your data is another common way of summarising data. Frequencies are normally calculated when working with discrete variables that have a finite number of values, such as categorical data. In our surveys_complete dataset, let us obtain the frequecies of male and female animals present. We can do this by counting the number of “M” and “F” present in the dataset. To do this use the dplyr function count as follows:
surveys_complete %>%
count(sex)#> # A tibble: 2 x 2
#> sex n
#> <chr> <int>
#> 1 F 14520
#> 2 M 16031
As you can see count has grouped the categories present in the sex column and returned the frequency of each category. If we wanted to count combination of factors, such as sex and species, we would specify the first and the second factor as the arguments of count():
surveys_complete %>%
count(sex, species) #> # A tibble: 41 x 3
#> sex species n
#> <chr> <chr> <int>
#> 1 F albigula 606
#> 2 F baileyi 1593
#> 3 F eremicus 539
#> 4 F flavus 711
#> 5 F fulvescens 55
#> 6 F fulviventer 15
#> 7 F hispidus 91
#> 8 F leucogaster 436
#> 9 F leucopus 16
#> 10 F maniculatus 354
#> # … with 31 more rows
Challenge
- How many animals were observed in each
plot_typesurveyed?Answer
surveys_complete %>% count(plot_type)
- What is the frequency of each species of each sex observed? Sort each species in descending order of freqency.
Answer
surveys_complete %>% count(sex, species) %>% arrange(species, desc(n))
If you would like to plot the distribution of a single continuous variable the frequency will be automatically calculated, so you do not need to use count() to calculate the frequency beforehand. The x-axis is automatically divided into bins and the number of observations of the continuous variable in each bin is shown as a bar in the histogram. In the ggplot2 package a histogram can be plotted using the geom_histogram function.
Let us plot a histogram for the continuous variable weight:
ggplot(surveys_complete, aes(weight)) +
geom_histogram()
You can identify display categories in the histogram by plotting a stacked histogram which will show categories for each group stacked on top of each other. This is done by using the fill argument in the aesthetic function. If we want to display sex in our weight histogram:
ggplot(surveys_complete, aes(weight, fill = sex)) +
geom_histogram(bins = 100)
Note that the default number of bins in a histogram is 30. To get have a granular display you can increase the number of bins by using the argument bins in the geom_histogram function as above.
There are other plots that can be used for a single continuous variable (see ONE VARIABLE continuous section on ggplot2 cheat sheet).
In the examples above we learnt how to summarise data over all observations, e.g., we calculated the mean over all observations using the summarise function. However, in data analysis, especially when dealing with big data, a common approach to data exploration is the split-apply-combine strategy. The idea behind this strategy is to split the data into more managable pieces, apply any operations required on the data independently on each piece and then combine the results together. The figure below illustrates the approach that is done in the split-apply-combine approach.
Let’s work on an example on how we can apply the split-apply-combine strategy on the surveys_complete dataset. We would like to split the data by the different categories present in the sex column and calculate the mean weight for each category. We can do this easily by using the group_by function in the dplyr package:
surveys_complete %>%
group_by(sex) %>%
summarise(mean_weight = mean(weight))#> # A tibble: 2 x 2
#> sex mean_weight
#> <chr> <dbl>
#> 1 F 41.5
#> 2 M 42.1
You can also group by multiple columns:
surveys_complete %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight))#> # A tibble: 46 x 3
#> # Groups: sex [2]
#> sex species_id mean_weight
#> <chr> <chr> <dbl>
#> 1 F BA 9.16
#> 2 F DM 41.6
#> 3 F DO 48.5
#> 4 F DS 117.
#> 5 F NL 154.
#> 6 F OL 30.8
#> 7 F OT 24.8
#> 8 F OX 22
#> 9 F PB 30.2
#> 10 F PE 22.8
#> # … with 36 more rows
Once the data are grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
surveys_complete %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight),
min_weight = min(weight))#> # A tibble: 46 x 4
#> # Groups: sex [2]
#> sex species_id mean_weight min_weight
#> <chr> <chr> <dbl> <dbl>
#> 1 F BA 9.16 6
#> 2 F DM 41.6 10
#> 3 F DO 48.5 12
#> 4 F DS 117. 45
#> 5 F NL 154. 32
#> 6 F OL 30.8 10
#> 7 F OT 24.8 5
#> 8 F OX 22 22
#> 9 F PB 30.2 12
#> 10 F PE 22.8 11
#> # … with 36 more rows
Challenge
- Use
group_by()andsummarise()to find the mean, min, and max hindfoot length for each species (usingspecies_id). Also add the number of observations in a column calledn_obs(hint: see?n).Answer
surveys_complete %>% group_by(species_id) %>% summarise( mean_hindfoot_length = mean(hindfoot_length), min_hindfoot_length = min(hindfoot_length), max_hindfoot_length = max(hindfoot_length), n_obs = n() )#> # A tibble: 24 x 5 #> species_id mean_hindfoot_length min_hindfoot_length max_hindfoot_length n_obs #> <chr> <dbl> <dbl> <dbl> <int> #> 1 BA 13 6 16 45 #> 2 DM 36.0 16 50 9702 #> 3 DO 35.6 26 64 2781 #> 4 DS 50.0 39 58 2023 #> 5 NL 32.2 21 42 1045 #> 6 OL 20.5 12 39 905 #> 7 OT 20.3 13 50 2062 #> 8 OX 20.4 19 21 5 #> 9 PB 26.1 2 47 2762 #> 10 PE 20.2 11 30 1197 #> # … with 14 more rows
- What was the heaviest animal measured in each year? Return the columns
year,genus,species, andweight.Answer
surveys_complete %>% group_by(year) %>% filter(weight == max(weight)) %>% select(year, genus, species_id, weight) %>% arrange(year)#> # A tibble: 28 x 4 #> # Groups: year [26] #> year genus species_id weight #> <dbl> <chr> <chr> <dbl> #> 1 1977 Dipodomys DS 149 #> 2 1978 Neotoma NL 232 #> 3 1978 Neotoma NL 232 #> 4 1979 Neotoma NL 274 #> 5 1980 Neotoma NL 241 #> 6 1980 Neotoma NL 241 #> 7 1981 Neotoma NL 251 #> 8 1982 Neotoma NL 252 #> 9 1983 Neotoma NL 256 #> 10 1984 Neotoma NL 259 #> # … with 18 more rows
The ggplot2 package has a way of creating different plots based on the different categories in the data. This is known as facetting. With facetting we do not need to use group_by() to split the data into different groups to be able to plot the different categories in different plots as ggplot2 does this automatically.
There are two types of facet functions:
facet_wrap() arranges the different plots into muliple rows and columns to cleanly fit on one page.facet_grid() plots all the categories in 1 row or 1 column.Let us see this in action.
surveys_complete %>%
group_by(year, genus) %>%
count() %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus))
As you can see, each genus has been plotted as a separate plot. It is now clear which are the genera that were observed the most. Another advantage of facetting is that it uses a common axes and all plots are aligned to the same values on the axes, making the different plots comparable. If you want to have different axes for each plot you can do so by using the scales argument.
surveys_complete %>%
group_by(year, genus) %>%
count() %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus), scales = "free")
The pattern of the graphs that before were hardly visible, e.g., Baiomys, is now clear as the axes have been rescaled to fit the data. This is the main advantage of using free scales. The disadvantage is that the different plots are not comparable as before.
If we would like to see if there is any difference between the sex, we can do this by adding sex as another grouping to count(). Let’s assign it to an object yearly_sex_counts.
yearly_sex_counts <- surveys_complete %>%
count(year, genus, sex)yearly_sex_counts will now look like:
#> # A tibble: 389 x 4
#> year genus sex n
#> <dbl> <chr> <chr> <int>
#> 1 1977 Chaetodipus F 3
#> 2 1977 Dipodomys F 103
#> 3 1977 Dipodomys M 119
#> 4 1977 Onychomys F 2
#> 5 1977 Onychomys M 1
#> 6 1977 Perognathus F 14
#> 7 1977 Perognathus M 8
#> 8 1977 Peromyscus M 2
#> 9 1977 Reithrodontomys F 1
#> 10 1977 Reithrodontomys M 1
#> # … with 379 more rows
This should now allow us to also split by sex. We can use colour to distinguish between the sex categories:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_wrap(facets = vars(genus))
Let us do the same thing with facet_grid() so that we can understand the difference between the two facetting techniques in the ggplot2 package. With facet_grid() you specify what variable you would like to split on as in the rows or cols arguments:
surveys_complete %>%
group_by(year, genus) %>%
count() %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
# Display the genera as columns
facet_grid(cols = vars(genus))
As you can see facet_grid() placed all the categories of genus in 1 row, unlike facet_wrap() which have spread them over multiple rows to fit well in 1 page. Let us split the plots by sex as well by plotting sex as the rows:
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
More information on further functionality of facetting can be found in the
facet_wrap()andfacet_grid()documentation.
Challenge
Instead of splitting the plots based on
sexdisplay thesexas different coloured line graphs in the same plot.Answer
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) + geom_line() + facet_grid(cols = vars(genus))![]()
Though the default visualisation of ggplot2 plots is already at a good standard, there are several ways one can improve even further the visualisations.
Let us start customising the last plot we have plotted by renaming the axes and adding a title to the plot. This is done by using the labs function:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals")
The major item that needs fixing in the plot is the text on the x-axis as this crammed and is not readable at the moment. This is mainly due to the fact that the size of the plot is dependent on the size of the window (in this case RStudio). You can work around this by saving your plot to a file and specifying the width of the plot (see Saving a plot to a file section). Themes in the ggplot2 package control the display of all non-data elements of the plot. Let us start customising the text on the x-axis by changing its size and position using the theme function. Note that theme() has several other arguments and you can read more about them in the theme() documentation.
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals") +
theme(axis.text.x = element_text(size = 7, angle = 90, vjust = 0.5),
axis.text.y = element_text(size = 7),
strip.text=element_text(size = 7, angle = 45))
With the plot already looking better, the last thing we would like to change is the legend. Legends are very tricky in ggplot2 as the fuction to use is determined by the data that is being displayed. In this case the legend has been created based on colour groupings. Therefore we can change the lengend title, categories and colour as follows:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals") +
theme(axis.text.x = element_text(size = 7, angle = 90, vjust = 0.5),
axis.text.y = element_text(size = 7),
strip.text = element_text(size = 7, angle = 45)) +
scale_colour_brewer("Sex",
palette = "Set1",
breaks = c("F", "M"),
labels = c("Female", "Male"))
Note: If you would like to see what other palettes are available please see http://colourbrewer2.org/#type=qualitative&scheme=Set1&n=3.
ggplot2 has a set of themes that can be used to change the overall appearance of the graph without much effort. For example, if we create the first plot again and apply the theme_bw() theme we get a more simpler white background:
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point() +
theme_bw()
A list of themes can be found in the ggplot2 documentation.
Challenge
Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.
Answer
surveys_complete %>% group_by(year, species) %>% summarise(avg_weight = mean(weight)) %>% ggplot(mapping = aes(x = year, y = avg_weight)) + geom_line() + facet_wrap(facets = vars(species)) + theme_bw() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))![]()
Now that you have learned how to use dplyr to transform your raw data, you may want to export these new datasets to share them with your collaborators or for archival.
Similar to the read_csv function used for reading CSV files into R, there is a write_csv function that generates CSV files from data frames and tibbles which is also present in the readr package.
Before using write_csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.
Let us save the surveys_complete tibble in data_output/surveys_complete.csv file:
write_csv(surveys_complete, path = "data_output/surveys_complete.csv")Clearly the topics we have covered so far are just a small fraction of what is possible in R. Below is information on some more advanced topics of data wrangling and plotting. The topics and challenges are hopefully self-explanatory!
In the data exploration workflow we discussed how to structure our data leading to the four rules defining a tidy dataset:
Here we examine the fourth rule: Each type of observational unit forms a table.
In surveys , the rows of surveys contain the values of variables associated with each record (the unit), values such as the weight or sex of each animal associated with each record. What if instead of comparing records, we wanted to compare the different mean weight of each genus between plots? (Ignoring plot_type for simplicity).
We’d need to create a new table where each row (the unit) is comprised of values of variables associated with each plot. In practical terms this means the values in genus would become the names of column variables and the cells would contain the values of the mean weight observed on each plot.
Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.
The opposite transformation would be to transform column names into values of a variable.
We can do both these of transformations with two tidyr functions, pivot_wider() and pivot_longer().
pivot_wider()pivot_wider() takes three principal arguments:
datanames_from column variable whose values will become new column names.values_from column variable whose values will fill the new column variables.Further arguments include values_fill which, if set, fills in missing values with the value provided.
Let’s use pivot_wider() to transform surveys to find the mean weight of each genus in each plot over the entire survey period. We use filter(), group_by() and summarise() to filter our observations and variables of interest, and create a new variable for the mean_weight. We use the pipe as before too.
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarise(mean_weight = mean(weight))
str(surveys_gw)This yields surveys_gw where the observations for each plot are spread across multiple rows, 196 observations of 3 variables. Using pivot_wider() to key on genus with values from mean_weight this becomes 24 observations of 11 variables, one row for each plot. We again use pipes:
surveys_wider <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight)
str(surveys_wider)
We could now plot comparisons between the weight of genera in different plots, although we may wish to fill in the missing values first, using the values_fill argument. Note: be careful when replacing NA with zero, since a mean_weight of would be recognised as an actual measurement, and not as a missing value!
surveys_gw %>%
pivot_wider(
names_from = genus,
values_from = mean_weight,
values_fill = list(mean_weight = 0)
)pivot_longerThe opposing situation could occur if we had been provided with data in the form of surveys_wider, where the genus names are column names, but we wish to treat them as values of a genus variable instead.
In this situation we are lenghtening the data by taking the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer() takes four principal arguments:
datacols containing the names of the columns we want to pivot into a longer formatnames_to containing a string specifying the name of the column to create from the data stored in the column names of datavalues_tocontaining a string specifying the name of the column to create from the data stored in cell values.To recreate surveys_gw from surveys_wider we would create a column using names_to called genus. We would also need to create a column called mean_weight using the values_to argument. Lastly, we would use all columns apart from plot_id to pivot into a longer format. Here we drop plot_id column with a minus sign.
surveys_longer <- surveys_wider %>%
pivot_longer(
cols = -plot_id,
names_to = "genus",
values_to = "mean_weight")
str(surveys_longer)
Note that now the NA genera are included in the longer format. Making data wider and then longer can be a useful way to balance out a dataset so every replicate has the same composition.
We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are in a row, we don’t even need to list them all out - just use the : operator!
surveys_wider %>%
pivot_longer(
cols = Baiomys:Spermophilus,
names_to = "genus",
values_to = "mean_weight") %>%
head()Challenge
- Take the
surveysdata frame and make it wider, usingyearas columns,plot_idas rows, and the number of genera per plot as the values. You will need to summarise before reshaping, and use the functionn_distinct()to get the number of unique genera within a particular chunk of data. It’s a powerful function! See?n_distinctfor more.Answer
surveys_wider_genera <- surveys %>% group_by(plot_id, year) %>% summarise(n_genera = n_distinct(genus)) %>% pivot_wider(names_from = year, values_from = n_genera) head(surveys_wider_genera)
- Now take that data frame and
pivot_longer()it again, so each row is a uniqueplot_idbyyearcombination.Answer
surveys_wider_genera %>% pivot_longer( cols = -plot_id, names_to = "year", values_to = "n_genera")
- The
surveysdata set has two measurement columns:hindfoot_lengthandweight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, usepivot_longer()to create a dataset where we have a key column calledmeasurementand avaluecolumn that takes on the value of eitherhindfoot_lengthorweight. Hint: You’ll need to specify which columns are being pivoted.Answer
surveys_long <- surveys %>% pivot_longer( cols = c(hindfoot_length, weight), names_to = "measurement", values_to = "value")
- With this new data set, calculate the average of each
measurementin eachyearfor each differentplot_type. Then make the data set wider usingpivot_wider()with a column forhindfoot_lengthandweight. Hint: You only need to specify the name and value columns forpivot_wider().Answer
surveys_long %>% group_by(year, measurement, plot_type) %>% summarise(mean_value = mean(value, na.rm = TRUE)) %>% pivot_wider(names_from = measurement, values_from = mean_value)
Often data is spread out over multiple tables, instead of in one large, single table. Why collect data in multiple tables? - Not efficient to include all information in a single table. - Redundant information makes it more difficult to update or revise data. - Make changes in one place, not hundreds of places. - Use multiple tables - Each table contains a single kind of information, for example in our case we could split the data into three tables: - surveys: information about individuals - species: information about species - plots: information about plots - If a species name changes we only need to change it in the species table - Connect tables using joins to describe relationships between tables
To illustrate the joining of different tables we will create two tables: hindfoot_length_join and weight_join, which contain the average hindfoot_length and weight per genus, respectively. To simplify the example, we are only using three genera, two of which are common between the tables. These are taken using the slice function. For more information, see ?slice.
# Create a summary table for mean hindfoot_length
hindfoot_length_join <- surveys %>%
group_by(genus) %>%
summarise(hindfoot_length = mean(hindfoot_length, na.rm = TRUE)) %>%
slice(4,8,11)
# Create a summary table for mean weight
weight_join <- surveys %>%
group_by(genus) %>%
summarise(weight = mean(weight, na.rm = TRUE)) %>%
slice(4,8,15)
hindfoot_length_join#> # A tibble: 3 x 2
#> genus hindfoot_length
#> <chr> <dbl>
#> 1 Baiomys 13
#> 2 Chaetodipus 23.9
#> 3 Dipodomys 37.9
weight_join#> # A tibble: 3 x 2
#> genus weight
#> <chr> <dbl>
#> 1 Baiomys 8.6
#> 2 Chaetodipus 24.2
#> 3 Perognathus 8.38
As can be seen above, the Baiomys and Chaetodipus genera are common between the two tables. Dipodomys only occurs in hindfoot_length_join and Perognathus only in weight_join.
Joining of tables is done using the mutating join functions in dplyr documentation. All of join functions take three arguments, two table names (the tables to join, the left-hand side and right-hand side) and a common key defined in by =. Here we look at the four most commonly used join functions, which are illustrated in the image below.
left-join. This joins matching rows from the right-hand side table to the left-hand side table.right_join. This joins matching rows from the left-hand side table to the right-hand side table.inner_join. This retains only rows that occur in both tables.full_join. This retains all values from all rows.In this case we want to join the data by genus, so we use by = genus as the common key.